Skip to content

Cancer Treatment

Business Understanding

Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers). Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature. For this study we are using MSKCC Data set which has an expert-annotated knowledge base where researchers and oncologists have manually annotated thousands of mutations. We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.


Data Credits : Kaggle - Memorial Sloan Kettering Cancer Center (MSKCC)

Domain Review

Business Objectives and Constraints

  • Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
  • This model should be useful to decrease the man hours required to solve the problem
  • Output a probability of the prediced class so that the pathologist can be confident about the classification
  • The cost of a mis-classification and errors is very high.
  • Interpretability is really important.
  • No strict latency concerns.

Machine Learing Problem Statement

we will start by importing important libraries and defining some constants

## importing libraries
from utils.base import *
css()

## Defining constants
title = "Cancer Treatment"
raw_training_text_file_path = f"data/{title}/01_raw/training_text.csv"
raw_training_variants_file_path = f"data/{title}/01_raw/training_variants.csv"
raw_training_text_dataprep_report_path = f"Assets/reports/{title}/raw_training_text_dataprep.html"
raw_training_variants_dataprep_report_path = f"Assets/reports/{title}/raw_training_variants_dataprep.html"

Data Overview

There are 2 datasets available in raw files we will load these into df_variants and df_text

Glimpse of df_variants

df_variants = pd.read_csv(raw_training_variants_file_path)
glimpse(df_variants)
create_report(df_variants, title=title, progress=False).save(raw_training_variants_dataprep_report_path)
Shape:  (3321, 4)

Info ID Gene Variation Class
dtype int64 object object int64
# of Nas 0 0 0 0
% of Na 0.0 0.0 0.0 0.0
# of uniques 3321 264 2996 9
% of unique 100.0 7.949413 90.213791 0.271003
Count of dtypes
int64 2
object 2
Stats ID Class
count 3321.000000 3321.000000
mean 1660.000000 4.365854
std 958.834449 2.309781
min 0.000000 1.000000
25% 830.000000 2.000000
50% 1660.000000 4.000000
75% 2490.000000 7.000000
max 3320.000000 9.000000
Sample ID Gene Variation Class
385 385 TP53 R249W 4
443 443 TP53 G244R 4
2488 2488 BRCA1 W1782C 5
644 644 CDKN2A E69G 4
2084 2084 MYD88 L265P 7
329 329 ROS1 MSN-ROS1 Fusion 2
1604 1604 VHL L188Q 4
1113 1113 FANCA Deletion 1
2543 2543 BRCA1 C27A 4
1628 1628 VHL R167Q 4
Report has been saved to Assets/reports/Cancer Treatment/raw_training_variants_dataprep.html!

Generated Report

Glimpse of df_text

df_text = pd.read_csv(raw_training_text_file_path, sep="\|\|", engine="python", names=["ID","TEXT"], skiprows=1)
glimpse(df_text)
create_report(df_text, title=title, progress=False).save(raw_training_text_dataprep_report_path)
Shape:  (3321, 2)

Info ID TEXT
dtype int64 object
# of Nas 0 5
% of Na 0.0 0.150557
# of uniques 3321 1920
% of unique 100.0 57.813911
Count of dtypes
int64 1
object 1
Stats ID
count 3321.000000
mean 1660.000000
std 958.834449
min 0.000000
25% 830.000000
50% 1660.000000
75% 2490.000000
max 3320.000000
Sample ID TEXT
801 801 Recent efforts to comprehensively characterize...
1945 1945 Mycosis fungoides and Sézary syndrome are prim...
1526 1526 Anaplastic lymphoma kinase (ALK) tyrosine kina...
2735 2735 Abstract' ' ' ' Mutations in the B-Raf gene ha...
1854 1854 The karyotypic chaos exhibited by human epithe...
2786 2786 Mutation screening of the breast and ovarian c...
665 665 Inherited mutations affecting the INK4a/ARF lo...
2234 2234 The tumor suppressor gene PTEN is frequently m...
430 430 The human p53 gene encodes a nuclear protein t...
1278 1278 The three-dimensional structure of the complex...
Report has been saved to Assets/reports/Cancer Treatment/raw_training_text_dataprep.html!

Generated Report

Preprocessing of text

# loading stop words from nltk library
stop_words = set(stopwords.words('english'))

def nlp_preprocessing(df_text,total_text, index, column):
    if type(total_text) is not int:
        string = ""
        # replace every special char with space
        total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
        # replace multiple spaces with single space
        total_text = re.sub('\s+',' ', total_text)
        # converting all the chars into lower-case.
        total_text = total_text.lower()

        for word in total_text.split():
        # if the word is a not a stop word then retain that word from the data
            if not word in stop_words:
                string += word + " "

        df_text[column][index] = string
#text processing stage.
start_time = time.perf_counter()
for index, row in df_text.iterrows():
    if type(row['TEXT']) is str:
        nlp_preprocessing(df_text, row['TEXT'], index, 'TEXT')
    else:
        print("there is no text description for id:",index)
print('Time took for preprocessing the text :',time.perf_counter() - start_time, "seconds")
there is no text description for id: 1109
there is no text description for id: 1277
there is no text description for id: 1407
there is no text description for id: 1639
there is no text description for id: 2755
Time took for preprocessing the text : 24.476304084999995 seconds

# merging both gene_variations and text data based on ID
result = pd.merge(df_variants, df_text,on='ID', how='left')
result.head()
ID Gene Variation Class TEXT
0 0 FAM58A Truncating Mutations 1 cyclin dependent kinases cdks regulate variety...
1 1 CBL W802* 2 abstract background non small cell lung cancer...
2 2 CBL Q249E 2 abstract background non small cell lung cancer...
3 3 CBL N454D 3 recent evidence demonstrated acquired uniparen...
4 4 CBL L399V 4 oncogenic mutations monomeric casitas b lineag...
result[result.isnull().any(axis=1)]
ID Gene Variation Class TEXT
1109 1109 FANCA S1088F 1 NaN
1277 1277 ARID5B Truncating Mutations 1 NaN
1407 1407 FGFR3 K508M 6 NaN
1639 1639 FLT1 Amplification 6 NaN
2755 2755 BRAF G596C 7 NaN
result.loc[result['TEXT'].isnull(),'TEXT'] = result['Gene'] +' '+result['Variation']
result[result['ID'].isin([1109,1277,1407,1639,2755])]
ID Gene Variation Class TEXT
1109 1109 FANCA S1088F 1 FANCA S1088F
1277 1277 ARID5B Truncating Mutations 1 ARID5B Truncating Mutations
1407 1407 FGFR3 K508M 6 FGFR3 K508M
1639 1639 FLT1 Amplification 6 FLT1 Amplification
2755 2755 BRAF G596C 7 BRAF G596C

Test, Train and Cross Validation Split

Splitting the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data respectively

y_true = result['Class'].values
result.Gene      = result.Gene.str.replace('\s+', '_')
result.Variation = result.Variation.str.replace('\s+', '_')

# split by maintaining same distribution of output varaible 'y_true' [stratify=y_true]
X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify=y_true, test_size=0.2)
# split by maintaining same distribution of output varaible 'y_train' [stratify=y_train]
train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)

We split the data into train, test and cross validation data sets, preserving the ratio of class distribution in the original data set

print('Number of data points in train data:', train_df.shape[0])
print('Number of data points in test data:', test_df.shape[0])
print('Number of data points in cross validation data:', cv_df.shape[0])
Number of data points in train data: 2124
Number of data points in test data: 665
Number of data points in cross validation data: 532

Distribution of \(y_i's\) in Train, Test and Cross Validation datasets

train_class_distribution = train_df['Class'].value_counts().sort_index()
cv_class_distribution = cv_df['Class'].value_counts().sort_index()
test_class_distribution = test_df['Class'].value_counts().sort_index()

train_class_distribution_norm = train_class_distribution/train_class_distribution.sum()
cv_class_distribution_norm = cv_class_distribution/cv_class_distribution.sum()
test_class_distribution_norm = test_class_distribution/test_class_distribution.sum()

norm_df = pd.concat([train_class_distribution_norm, cv_class_distribution_norm, test_class_distribution_norm], axis=1)
norm_df.columns = ['train','cv','test']
ax = norm_df.plot(kind='bar', figsize=(16,8))

The distribution is really nice and even amongst the train, test and cross validation

Random Model Prediction

Now we will create a random model which would serve as a dumb baseline and we can measure the performace of our model against that model

test_data_len = test_df.shape[0]
cv_data_len = cv_df.shape[0]

cv_predicted_y = np.zeros((cv_data_len,9))
for i in range(cv_data_len):

    rand_probs = np.random.rand(1,9)
    cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))

test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
    rand_probs = np.random.rand(1,9)
    test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))

predicted_y = np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
Log loss on Cross Validation Data using Random Model 2.503933530991029
Log loss on Test Data using Random Model 2.582919708614705

Univariate Analysis

Process Description

High Cardinal/Dimensionality Categorical Feature

  1. How many categories are there and How they are distributed?

  2. How to encode high dimensionality categorical feature ?

    • One hot Encoding (Higher No of Dimensions)
    • Response coding (Lower No of Dimensions) with laplace smoothing
      • Laplace smoothing is used during caculation of probability of a feature value belongs to any particular class

Text Based Feature

  1. How many unique words are present in train data?
  2. How are word frequencies distributed?
  3. How to encode text features to numbers?

Feature Impact Analysis

  1. How good is a feature in predicting \(y_i\)?

    • Build an ML model using just the feature in question. For Ex. build a logistic regression model using the feature to predict \(y_i\). Compare the results and metrics with the Random Model prediction.
  2. Is the feature stable across all the data sets (Test, Train, Cross validation)

    • Check the distribution across Train and Cross validation sets
    • Check the metrics for Cross validation and Test sets done in Impact Analysis (should be comparable to train)
    • Coverage: How many data points in Test and CV datasets are covered by the unique feature values in train dataset?

EDA

unique_genes = train_df['Gene'].value_counts()
s = sum(unique_genes.values);
h = unique_genes.values/s;
c = np.cumsum(h)

fig,ax = plt.subplots(1,2, figsize=(16,8))
ax[0].plot(h)
ax[0].set_xlabel('Index of a Gene')
ax[0].set_ylabel('Number of Occurances')
ax[0].set_title("Histrogram of Genes")
ax[0].grid()

ax[1].plot(c)
ax[1].set_xlabel('Index of a Gene')
ax[1].set_ylabel('Cumulative fraction')
ax[1].set_title("Cumulative distribution of Genes")
ax[1].grid()
plt.tight_layout()

unique_variations = train_df['Variation'].value_counts()
s = sum(unique_variations.values);
h = unique_variations.values/s;
c = np.cumsum(h)

fig, ax = plt.subplots(1,2, figsize=(16,8))
ax[0].plot(h)
ax[0].set_xlabel('Index of a Variation')
ax[0].set_ylabel('Number of Occurances')
ax[0].set_title('Histrogram of Variations')
ax[0].grid()

ax[1].plot(c)
ax[1].set_xlabel('Cumulative fraction')
ax[1].set_ylabel('Number of Occurances')
ax[1].set_title('Cumulative distribution of Variations')
ax[1].grid()
plt.tight_layout()

Feature Transformation

Gene Feature

# Response coding
target = 'Class' 
feature = 'Gene' 
laplace_alpha = 10

train_gene_feature_responseCoding, cv_gene_feature_responseCoding, test_gene_feature_responseCoding = get_response_coded_feature(train_df, cv_df, test_df, feature, target, laplace_alpha)
print("train_gene_feature_responseCoding : shape of feature:", train_gene_feature_responseCoding.shape)

# one hot encoding
gene_vectorizer = CountVectorizer()
train_gene_feature_onehotCoding = gene_vectorizer.fit_transform(train_df[feature])
test_gene_feature_onehotCoding = gene_vectorizer.transform(test_df[feature])
cv_gene_feature_onehotCoding = gene_vectorizer.transform(cv_df[feature])

print("train_gene_feature_onehotCoding : shape of feature:", train_gene_feature_onehotCoding.shape)
train_gene_feature_responseCoding : shape of feature: (2124, 9)
train_gene_feature_onehotCoding : shape of feature: (2124, 229)

Variation Feature

# Response coding
target = 'Class' 
feature = 'Variation' 
laplace_alpha = 10

train_variation_feature_responseCoding, cv_variation_feature_responseCoding, test_variation_feature_responseCoding = get_response_coded_feature(train_df, cv_df, test_df, feature, target, laplace_alpha)
print("train_variation_feature_responseCoding : shape of feature:", train_variation_feature_responseCoding.shape)

# one hot encoding
variation_vectorizer = CountVectorizer()
train_variation_feature_onehotCoding = variation_vectorizer.fit_transform(train_df['Variation'])
test_variation_feature_onehotCoding = variation_vectorizer.transform(test_df['Variation'])
cv_variation_feature_onehotCoding = variation_vectorizer.transform(cv_df['Variation'])

print("train_variation_feature_onehotCoding : shape of feature:", train_variation_feature_onehotCoding.shape)
train_variation_feature_responseCoding : shape of feature: (2124, 9)
train_variation_feature_onehotCoding : shape of feature: (2124, 1967)

Text Feature

feature = 'TEXT'
target = 'Class'

# # One Hot Encoding
# # building a CountVectorizer with all the words that occured minimum 3 times in train data
text_vectorizer = CountVectorizer(min_df=3)
train_text_feature_onehotCoding = text_vectorizer.fit_transform(train_df[feature])
train_text_features = text_vectorizer.get_feature_names_out()

# train_text_feature_onehotCoding.sum(axis=0).A1 will sum every row and returns (1*number of features) vector
train_text_feature_counts = train_text_feature_onehotCoding.sum(axis=0).A1
text_feature_dict = dict(zip(train_text_features,train_text_feature_counts))

# Use the same vectorizer that was trained on train data
cv_text_feature_onehotCoding = text_vectorizer.transform(cv_df[feature])
test_text_feature_onehotCoding = text_vectorizer.transform(test_df[feature])

# Normalise
train_text_feature_onehotCoding = normalize(train_text_feature_onehotCoding, axis=0)
cv_text_feature_onehotCoding = normalize(cv_text_feature_onehotCoding, axis=0)
test_text_feature_onehotCoding = normalize(test_text_feature_onehotCoding, axis=0)

# # Response Coding
def words_value_count_in_text_df(df, feature):
    text_split = pd.DataFrame(df[feature].str.split().tolist(), index=df.index)
    dictionary = defaultdict(int, text_split.unstack().value_counts().to_dict())
    return dictionary

def get_text_responsecoding(df, class_dict_list, total_dict, feature, target):
    no_of_classes = len(df[target].unique())
    text_feature_responseCoding = np.zeros((df.shape[0],no_of_classes))
    for i in range(0, no_of_classes):
        row_index = 0
        for index, row in df.iterrows():
            sum_prob = 0
            for word in row[feature].split():
                sum_prob += math.log(((class_dict_list[i].get(word,0)+10 )/(total_dict.get(word,0)+90)))
            text_feature_responseCoding[row_index][i] = math.exp(sum_prob/len(row[feature].split()))
            row_index += 1
    return text_feature_responseCoding

class_dict_list = []
for i in range(1,10):
    class_text_df = train_df[train_df[target] == i]
    dictionary = words_value_count_in_text_df(class_text_df, feature)
    class_dict_list.append(dictionary)
total_dict = words_value_count_in_text_df(train_df, feature)

train_text_feature_responseCoding  = get_text_responsecoding(train_df, class_dict_list, total_dict, feature, target)
cv_text_feature_responseCoding  = get_text_responsecoding(cv_df, class_dict_list, total_dict, feature, target)
test_text_feature_responseCoding  = get_text_responsecoding(test_df, class_dict_list, total_dict, feature, target)

train_text_feature_responseCoding = (train_text_feature_responseCoding.T/train_text_feature_responseCoding.sum(axis=1)).T
cv_text_feature_responseCoding = (cv_text_feature_responseCoding.T/cv_text_feature_responseCoding.sum(axis=1)).T
test_text_feature_responseCoding = (test_text_feature_responseCoding.T/test_text_feature_responseCoding.sum(axis=1)).T

Transformed Feature EDA

print('train_gene_feature_responseCoding')
_ = pd.DataFrame(train_gene_feature_responseCoding).plot(bins=30, kind='hist', subplots=True, figsize=(16,16),layout=(3,3),  title='train_gene_feature_responseCoding', grid=True)
plt.tight_layout()
plt.show()

print('train_variation_feature_responseCoding')
_ = pd.DataFrame(train_variation_feature_responseCoding).plot(bins=30, kind='hist', subplots=True, figsize=(16,16),layout=(3,3), title='train_variation_feature_responseCoding', grid=True)
plt.tight_layout()
plt.show()

print('train_text_feature_responseCoding')
_ = pd.DataFrame(train_text_feature_responseCoding).plot(bins=30, kind='hist', subplots=True, figsize=(16,16),layout=(3,3), title='train_text_feature_responseCoding', grid=True)
plt.tight_layout()
plt.show()

print('train_text_feature_onehotCoding')
sorted_text_feature_dict = dict(sorted(text_feature_dict.items(), key=lambda x: x[1] , reverse=True))
sorted_text_occur = np.array(list(sorted_text_feature_dict.values()))
_ = pd.Series(Counter(sorted_text_occur)).sort_values(ascending=False)[:50].plot(kind='bar', figsize=(16,8), title='Number of words for a given frequency')
plt.tight_layout()
plt.show()
train_gene_feature_responseCoding

train_variation_feature_responseCoding

train_text_feature_responseCoding

train_text_feature_onehotCoding

Impact Anlaysis

Gene Feature One hot Encoded using SGD classifier + Calibration model with Log Loss

alpha = [10 ** x for x in range(-5, 1)] # hyperparam for SGD classifier.

cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_gene_feature_onehotCoding, y_train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_gene_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(train_gene_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_gene_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print('\n')

test_coverage=test_df[test_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]
cv_coverage=cv_df[cv_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]

print('In test data', test_coverage, 'out of',test_df.shape[0], ":",round((test_coverage/test_df.shape[0])*100,2), '%')
print('In cross validation data',cv_coverage, 'out of ',cv_df.shape[0],":" ,round((cv_coverage/cv_df.shape[0])*100,2), '%')
For values of alpha =  1e-05 The log loss is: 1.225157977779788
For values of alpha =  0.0001 The log loss is: 1.2186194232146514
For values of alpha =  0.001 The log loss is: 1.2698771600985577
For values of alpha =  0.01 The log loss is: 1.3892311253795766
For values of alpha =  0.1 The log loss is: 1.4786048680300583
For values of alpha =  1 The log loss is: 1.517888924555487

For values of best alpha =  0.0001 The train log loss is: 0.9626738410932817
For values of best alpha =  0.0001 The cross validation log loss is: 1.2186194232146514
For values of best alpha =  0.0001 The test log loss is: 1.213990889587606


In test data 645 out of 665 : 96.99 %
In cross validation data 511 out of  532 : 96.05 %

Since, the CV and Test errors are similar to the train error, Gene feature is Stable

Variation Feature One hot Encoded using SGD classifier + Calibration model with Log Loss

alpha = [10 ** x for x in range(-5, 1)]

cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_variation_feature_onehotCoding, y_train)

    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_variation_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)

    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(train_variation_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_variation_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))

print('\n')
test_coverage=test_df[test_df['Variation'].isin(list(set(train_df['Variation'])))].shape[0]
cv_coverage=cv_df[cv_df['Variation'].isin(list(set(train_df['Variation'])))].shape[0]
print('In test data',test_coverage, 'out of',test_df.shape[0], ":",(test_coverage/test_df.shape[0])*100)
print('In cross validation data',cv_coverage, 'out of ',cv_df.shape[0],":" ,(cv_coverage/cv_df.shape[0])*100)
For values of alpha =  1e-05 The log loss is: 1.6880748869472166
For values of alpha =  0.0001 The log loss is: 1.6848972129901907
For values of alpha =  0.001 The log loss is: 1.6873507994191796
For values of alpha =  0.01 The log loss is: 1.694110632040056
For values of alpha =  0.1 The log loss is: 1.7001843816504443
For values of alpha =  1 The log loss is: 1.7015408337552536

For values of best alpha =  0.0001 The train log loss is: 0.6457223722823235
For values of best alpha =  0.0001 The cross validation log loss is: 1.6848972129901907
For values of best alpha =  0.0001 The test log loss is: 1.707051708439472


In test data 68 out of 665 : 10.225563909774436
In cross validation data 66 out of  532 : 12.406015037593985

Since, the CV and Test errors are not so similar to the train error, Variation feature is unstable

Text Feature One hot Encoded using SGD classifier + Calibration model with Log Loss

alpha = [10 ** x for x in range(-5, 1)]

cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_text_feature_onehotCoding, y_train)

    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_text_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(train_text_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_text_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
For values of alpha =  1e-05 The log loss is: 1.3417099148783505
For values of alpha =  0.0001 The log loss is: 1.2077311513227988
For values of alpha =  0.001 The log loss is: 1.2220108736944972
For values of alpha =  0.01 The log loss is: 1.3291070684522948
For values of alpha =  0.1 The log loss is: 1.4905994716927748
For values of alpha =  1 The log loss is: 1.6978774030031254

For values of best alpha =  0.0001 The train log loss is: 0.6505120004239873
For values of best alpha =  0.0001 The cross validation log loss is: 1.2077311513227988
For values of best alpha =  0.0001 The test log loss is: 1.2064042366178185

Text feature is stable across all the data sets (Test, Train, Cross validation)

def get_intersec_text(df):
    df_text_vec = CountVectorizer(min_df=3)
    df_text_fea = df_text_vec.fit_transform(df['TEXT'])
    df_text_features = df_text_vec.get_feature_names_out()

    df_text_fea_counts = df_text_fea.sum(axis=0).A1
    df_text_fea_dict = dict(zip(list(df_text_features),df_text_fea_counts))
    len1 = len(set(df_text_features))
    len2 = len(set(train_text_features) & set(df_text_features))
    return len1,len2
len1,len2 = get_intersec_text(cv_df)
print(np.round((len2/len1)*100, 3), "% of word of Cross Validation appeared in train data")
len1,len2 = get_intersec_text(test_df)
print(np.round((len2/len1)*100, 3), "% of word of test data appeared in train data")
97.986 % of word of Cross Validation appeared in train data
96.803 % of word of test data appeared in train data

Machine Learning Models

def predict_and_plot_confusion_matrix(train_x, train_y,test_x, test_y, clf):
    clf.fit(train_x, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x, train_y)
    pred_y = sig_clf.predict(test_x)

    # for calculating log_loss we willl provide the array of probabilities belongs to each class
    print("Log loss :",log_loss(test_y, sig_clf.predict_proba(test_x)))
    # calculating the number of data points that are misclassified
    print("Number of mis-classified points :", np.count_nonzero((pred_y- test_y))/test_y.shape[0])
    plot_confusion_matrix(test_y, pred_y)

def report_log_loss(train_x, train_y, test_x, test_y,  clf):
    clf.fit(train_x, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x, train_y)
    sig_clf_probs = sig_clf.predict_proba(test_x)
    return log_loss(test_y, sig_clf_probs, eps=1e-15)

# this function will be used just for naive bayes
# for the given indices, we will print the name of the features
# and we will check whether the feature present in the test point text or not
def get_impfeature_names(indices, text, gene, var, no_features):
    gene_count_vec = CountVectorizer()
    var_count_vec = CountVectorizer()
    text_count_vec = CountVectorizer(min_df=3)

    gene_vec = gene_count_vec.fit(train_df['Gene'])
    var_vec  = var_count_vec.fit(train_df['Variation'])
    text_vec = text_count_vec.fit(train_df['TEXT'])

    fea1_len = len(gene_vec.get_feature_names_out())
    fea2_len = len(var_count_vec.get_feature_names_out())

    word_present = 0
    for i,v in enumerate(indices):
        if (v < fea1_len):
            word = gene_vec.get_feature_names_out()[v]
            yes_no = True if word == gene else False
            if yes_no:
                word_present += 1
                print(i, "Gene feature [{}] present in test data point [{}]".format(word,yes_no))
        elif (v < fea1_len+fea2_len):
            word = var_vec.get_feature_names_out()[v-(fea1_len)]
            yes_no = True if word == var else False
            if yes_no:
                word_present += 1
                print(i, "variation feature [{}] present in test data point [{}]".format(word,yes_no))
        else:
            word = text_vec.get_feature_names_out()[v-(fea1_len+fea2_len)]
            yes_no = True if word in text.split() else False
            if yes_no:
                word_present += 1
                print(i, "Text feature [{}] present in test data point [{}]".format(word,yes_no))

    print("Out of the top ",no_features," features ", word_present, "are present in query point")

Stacking all features

train_gene_var_onehotCoding = hstack((train_gene_feature_onehotCoding,train_variation_feature_onehotCoding))
test_gene_var_onehotCoding = hstack((test_gene_feature_onehotCoding,test_variation_feature_onehotCoding))
cv_gene_var_onehotCoding = hstack((cv_gene_feature_onehotCoding,cv_variation_feature_onehotCoding))

train_x_onehotCoding = hstack((train_gene_var_onehotCoding, train_text_feature_onehotCoding)).tocsr()
train_y = np.array(list(train_df['Class']))

test_x_onehotCoding = hstack((test_gene_var_onehotCoding, test_text_feature_onehotCoding)).tocsr()
test_y = np.array(list(test_df['Class']))

cv_x_onehotCoding = hstack((cv_gene_var_onehotCoding, cv_text_feature_onehotCoding)).tocsr()
cv_y = np.array(list(cv_df['Class']))

print("One hot encoding features :")
print("(number of data points * number of features) in train data = ", train_x_onehotCoding.shape)
print("(number of data points * number of features) in test data = ", test_x_onehotCoding.shape)
print("(number of data points * number of features) in cross validation data =", cv_x_onehotCoding.shape)
One hot encoding features :
(number of data points * number of features) in train data =  (2124, 55944)
(number of data points * number of features) in test data =  (665, 55944)
(number of data points * number of features) in cross validation data = (532, 55944)

train_gene_var_responseCoding = np.hstack((train_gene_feature_responseCoding,train_variation_feature_responseCoding))
test_gene_var_responseCoding = np.hstack((test_gene_feature_responseCoding,test_variation_feature_responseCoding))
cv_gene_var_responseCoding = np.hstack((cv_gene_feature_responseCoding,cv_variation_feature_responseCoding))

train_x_responseCoding = np.hstack((train_gene_var_responseCoding, train_text_feature_responseCoding))
test_x_responseCoding = np.hstack((test_gene_var_responseCoding, test_text_feature_responseCoding))
cv_x_responseCoding = np.hstack((cv_gene_var_responseCoding, cv_text_feature_responseCoding))

print(" Response encoding features :")
print("(number of data points * number of features) in train data = ", train_x_responseCoding.shape)
print("(number of data points * number of features) in test data = ", test_x_responseCoding.shape)
print("(number of data points * number of features) in cross validation data =", cv_x_responseCoding.shape)
 Response encoding features :
(number of data points * number of features) in train data =  (2124, 27)
(number of data points * number of features) in test data =  (665, 27)
(number of data points * number of features) in cross validation data = (532, 27)

Naive Bayes

alpha = [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000]
cv_log_error_array = []
for i in alpha:
    print("for alpha =", i)
    clf = MultinomialNB(alpha=i)
    clf.fit(train_x_onehotCoding, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x_onehotCoding, train_y)
    sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
    cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
    # to avoid rounding error while multiplying probabilites we use log-probability estimates
    print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]),cv_log_error_array[i]))
plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = MultinomialNB(alpha=alpha[best_alpha])
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)


predict_y = sig_clf.predict_proba(train_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
for alpha = 1e-05
Log Loss : 1.2954975305870766
for alpha = 0.0001
Log Loss : 1.2869374246985383
for alpha = 0.001
Log Loss : 1.2842631205424986
for alpha = 0.1
Log Loss : 1.2815875568364257
for alpha = 1
Log Loss : 1.2995403679730142
for alpha = 10
Log Loss : 1.3512446897455843
for alpha = 100
Log Loss : 1.3322494833632648
for alpha = 1000
Log Loss : 1.2809048617321397

For values of best alpha =  1000 The train log loss is: 1.0285530527044575
For values of best alpha =  1000 The cross validation log loss is: 1.2809048617321397
For values of best alpha =  1000 The test log loss is: 1.2867053550420369

Model Testing with best hyper paramters

clf = MultinomialNB(alpha=alpha[best_alpha])
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)
sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
# to avoid rounding error while multiplying probabilites we use log-probability estimates
print("Log Loss :",log_loss(cv_y, sig_clf_probs))
print("Number of missclassified point :", np.count_nonzero((sig_clf.predict(cv_x_onehotCoding)- cv_y))/cv_y.shape[0])
plot_confusion_matrix(cv_y, sig_clf.predict(cv_x_onehotCoding.toarray()))
Log Loss : 1.2809048617321397
Number of missclassified point : 0.40789473684210525

Feature Importance, Correctly classified point

test_point_index = 1
no_feature = 100
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices=np.argsort(-1*clf.feature_log_prob_)[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 1
Predicted Class Probabilities: [0.4597 0.1588 0.0104 0.0636 0.0274 0.0241 0.2489 0.0059 0.0011](../04597-01588-00104-00636-00274-00241-02489-00059-00011 "0.4597 0.1588 0.0104 0.0636 0.0274 0.0241 0.2489 0.0059 0.0011")
Actual Class : 1
--------------------------------------------------
12 Text feature [dna] present in test data point [True]
13 Text feature [type] present in test data point [True]
14 Text feature [function] present in test data point [True]
15 Text feature [protein] present in test data point [True]
17 Text feature [one] present in test data point [True]
18 Text feature [two] present in test data point [True]
19 Text feature [containing] present in test data point [True]
20 Text feature [region] present in test data point [True]
21 Text feature [wild] present in test data point [True]
22 Text feature [loss] present in test data point [True]
23 Text feature [conserved] present in test data point [True]
24 Text feature [amino] present in test data point [True]
25 Text feature [sequence] present in test data point [True]
26 Text feature [possible] present in test data point [True]
28 Text feature [effect] present in test data point [True]
29 Text feature [reduced] present in test data point [True]
30 Text feature [present] present in test data point [True]
31 Text feature [therefore] present in test data point [True]
33 Text feature [analysis] present in test data point [True]
35 Text feature [four] present in test data point [True]
37 Text feature [least] present in test data point [True]
38 Text feature [specific] present in test data point [True]
41 Text feature [structure] present in test data point [True]
43 Text feature [results] present in test data point [True]
44 Text feature [large] present in test data point [True]
46 Text feature [critical] present in test data point [True]
48 Text feature [used] present in test data point [True]
50 Text feature [three] present in test data point [True]
53 Text feature [determined] present in test data point [True]
54 Text feature [define] present in test data point [True]
55 Text feature [corresponding] present in test data point [True]
56 Text feature [control] present in test data point [True]
57 Text feature [identified] present in test data point [True]
59 Text feature [using] present in test data point [True]
60 Text feature [possibility] present in test data point [True]
61 Text feature [data] present in test data point [True]
62 Text feature [table] present in test data point [True]
63 Text feature [six] present in test data point [True]
65 Text feature [additional] present in test data point [True]
66 Text feature [shown] present in test data point [True]
67 Text feature [also] present in test data point [True]
68 Text feature [result] present in test data point [True]
69 Text feature [likely] present in test data point [True]
71 Text feature [discussion] present in test data point [True]
74 Text feature [different] present in test data point [True]
75 Text feature [whether] present in test data point [True]
77 Text feature [reveal] present in test data point [True]
78 Text feature [highly] present in test data point [True]
79 Text feature [remains] present in test data point [True]
80 Text feature [within] present in test data point [True]
81 Text feature [proteins] present in test data point [True]
82 Text feature [either] present in test data point [True]
83 Text feature [human] present in test data point [True]
84 Text feature [performed] present in test data point [True]
85 Text feature [significantly] present in test data point [True]
86 Text feature [expected] present in test data point [True]
87 Text feature [whereas] present in test data point [True]
88 Text feature [five] present in test data point [True]
89 Text feature [nonsense] present in test data point [True]
91 Text feature [gene] present in test data point [True]
93 Text feature [may] present in test data point [True]
94 Text feature [25] present in test data point [True]
97 Text feature [genetic] present in test data point [True]
98 Text feature [30] present in test data point [True]
99 Text feature [panel] present in test data point [True]
Out of the top  100  features  65 are present in query point

test_df['TEXT'].iloc[test_point_index]
'introduction platinum based chemotherapy standard care patients muscle invasive metastatic urothelial carcinoma 20 years 1 3 neoadjuvant cisplatin based chemotherapy leads 14 25 relative risk reduction death muscle invasive urothelial carcinoma ct2 t4an0m0 refs 3 5 pathologic downstaging complete response pt0 carcinoma situ ptis cystectomy occurs 26 38 patients treated neoadjuvant chemotherapy compared 12 3 15 patients undergoing cystectomy alone 3 4 6 5 year survival pt0 ptis patients 85 neoadjuvant chemotherapy 3 compared 43 patients persistent muscle invasive disease pt2 ref 7 therefore benefit neoadjuvant chemotherapy seems dramatic patients found pathologic complete responses time surgical resection however inability predict patients derive clinical benefit limited use toxic approach urologic community 8 10 cisplatin causes accumulation dna cross links interfere dna replication gene transcription eventually promotes cell death repair cisplatin induced dna damage occurs primarily dna repair pathways nucleotide excision repair ner ref 11 homologous recombination includes brca1 brca2 ref 12 ner pathway involves multiple genes including ercc1 5 cdk7 ddb1 2 xpa xpc 13 germline alterations ner genes result multiple recessive inherited disorders including xeroderma pigmentosum xp ref 13 deficiency ner patients xp significantly increased risk developing skin cancers malignancies 14 germline expression based changes several ner genes suggested modulate clinical response cisplatin based chemotherapy 15 16 however prospective studies confirmed observations 17 18 ercc2 ner helicase loss function correlates cisplatin sensitivity preclinical models 19 whereas ercc2 overexpression leads cisplatin resistance 20 data suggest tumors loss ner function may exhibit increased cisplatin sensitivity recent data identified somatic ercc2 mutations 7 12 urothelial carcinomas 21 22 hypothesized somatic mutations ner pathway may correlate response cisplatin based neoadjuvant chemotherapy patients urothelial carcinoma test generally define genomic correlates chemotherapy response performed whole exome sequencing wes urothelial carcinoma tumors patients extreme responses neoadjuvant cisplatin based combination chemotherapy results somatic genetic alterations muscle invasive urothelial carcinoma sequenced pretreatment tumor germline dna 50 patients treated neoadjuvant cisplatin based chemotherapy muscle invasive urothelial carcinoma 25 responders residual invasive disease pt0 ptis pathologic examination following cystectomy 25 nonresponders residual muscle invasive pt2 disease fig 1a methods although multiple chemotherapeutic regimens used common agent among patients cisplatin table 1 supplementary table s1 significant differences clinical characteristics identified responders nonresponders baseline p 0 05 mann whitney test figure 1 download figureopen new tabdownload powerpoint figure 1 study design mutation rates aggregate significant somatic mutations patients muscle invasive urothelial carcinoma cancer split cases controls based pathologic response cisplatin based neoadjuvant chemotherapy turbt transurethral resection bladder tumor nine cases could complete sequencing due technical reasons failed sequencing elevated contamination b data arranged column represents tumor row represents gene center panel divided responders left black nonresponders right yellow mutation rates responders elevated compared nonresponders top alteration landscape center aggregate cohort n 50 patients demonstrates set statistically significant genes altered urothelial carcinoma tp53 rb1 kdm6a arid1a negative log q values significance level mutated genes shown genes q 0 1 right ercc2 mutation status also shown genes although ercc2 significantly mutated across combined cohort additional data allelic fraction ranges case bottom mutation rates top mutational frequency left also summarized view inlineview popup table 1 patient characteristics mean target coverage wes 121 tumors 130 paired germline samples supplementary table s1 median mutation rate 9 7 mutations per megabase mutations mb responders 4 4 mutations mb nonresponders p 0 0003 mann whitney test fig 1b supplementary fig s1a s1b supplementary table s1 raising possibility reduced dna repair fidelity among cisplatin responders observed somatic mutations short insertion deletions reported supplementary table s2 statistical assessment 23 base mutations short insertion deletions across responders nonresponders nominated four significantly altered genes previously implicated urothelial carcinoma 21 22 24 tp53 rb1 kdm6a arid1a fig 1b supplementary tables s3 s4 addition nine nonsynonymous somatic mutations observed ercc2 fig 1b supplementary fig s2 supplementary table s2 although ercc2 reach cohort wide statistical significance known role dna repair report recurrently mutated bladder cancer 21 22 raised possibility ercc2 mutations might associate cisplatin response somatic ercc2 mutations cisplatin based chemotherapy responders performed enrichment analysis identify genes selectively mutated responders compared nonresponders supplementary methods among 3 277 genes least one possibly damaging somatic alteration supplementary methods ercc2 gene significantly enriched responder cohort fig 2a supplementary tables s5 s6 indeed ercc2 nonsynonymous somatic mutations occurred cisplatin sensitive tumors p 0 001 fisher exact test ercc2 remained significantly enriched responders following false discovery analysis performed genes mutation frequency afforded adequate power q 0 007 benjamini hochberg fig 2b moreover enrichment ercc2 mutations responder group significant adjusted differences overall mutation rate responders nonresponders p 0 04 binomial test toward end median background mutation rate ercc2 mutant tumors 15 5 mutations mb significantly elevated compared ercc2 wild type wt tumors 5 1 mutations mb p 0 01 mann whitney test supplementary fig s1b consistent possible dna repair defect prior reports 22 figure 2 download figureopen new tabdownload powerpoint figure 2 three tests examining selective enrichment ercc2 mutations cisplatin responder tumors plot mutsigcv gene level significance log10 mutsigcv p value responder enrichment significance log10 fisher exact test p value size point proportional number responder patients harbor alterations gene genes responder enrichment p value 0 01 shown red others gray dashed line denotes p value 0 01 ercc2 reaches statistical significance responder cohort p 0 001 fisher exact test b among genes sufficient number alterations cohort comparisons n 9 ercc2 somatic mutations occur exclusively cisplatin responders significant accounting elevated mutation rate responders compared nonresponders p 0 05 compared unselected tcga guo et al urothelial carcinoma cohorts c shows ercc2 somatic mutations significantly enriched responder cohort p 0 01 somatic ercc2 mutation frequency responder cohort also compared two unselected bladder cancer populations 130 cases cancer genome atlas tcga ref 21 99 cases chinese patient cohort ref 22 fig 2c compared unselected populations ercc2 mutations significantly enriched cisplatin responder cohort 36 cases p 0 001 binomial test fig 2c ercc2 mutation status seem prognostic impact overall survival tcga cohort p 0 77 log rank supplementary fig s3 determine relative abundance somatic ercc2 mutations tumor types tcga data 19 tumor types n 4 429 queried 25 somatic ercc2 mutations observed low frequencies 4 11 tumor types fig 3a b figure 3 download figureopen new tabdownload powerpoint figure 3 ercc2 mutation mapping distribution across tumor types stick plot ercc2 showing locations somatic mutations responders compared ercc2 mutations observed two separate unselected bladder cancer exome cohorts ercc2 mutations cluster within near conserved helicase motifs b somatic ercc2 mutation frequency multiple tumor types tcga scc squamous cell carcinoma c structure archaebacterial ercc2 pdb code 3crv mutations identified responder cohort mapped equivalent position locations shown context canonical germline ercc2 mutations responsible xp xp cockayne syndrome cs trichothiodystrophy ttd identified somatic ercc2 mutations occurred highly conserved amino acid positions within immediately adjacent helicase domains fig 3a c supplementary fig s4 similarly germline ercc2 mutations patients xp complementation group xp combined cockayne syndrome xp cs two disorders characterized ner function clustered within helicase domains fig 3c conversely mutations causing trichothiodystrophy disease resulting alteration ercc2 role transcription distributed throughout protein 26 ercc2 mutations confer increased cisplatin sensitivity vitro observations raised possibility identified mutations disrupt ercc2 function interfere ner test hypothesis first five identified ercc2 mutants stably expressed immortalized ercc2 deficient cell line derived xp patient cisplatin sensitivity profile cell line measured supplementary methods fig 4a expression wt ercc2 rescued cisplatin sensitivity ercc2 deficient cell line whereas none ercc2 mutants rescued cisplatin sensitivity fig 4b ic50 ercc2 wt complemented cell line significantly higher ercc2 deficient parent cell line p 0 0001 anova whereas ic50 ercc2 mutant complemented cell line significantly different parent ercc2 deficient parent cell line fig 4c similarly ic50 ercc2 wt cell line significantly higher ercc2 deficient mutant cell lines p 0 0001 anova figure 4 download figureopen new tabdownload powerpoint figure 4 ercc2 mutants fail rescue cisplatin sensitivity ercc2 deficient cells immunoblot ercc2 expression cell lines created transfection ercc2 deficient parent cell line gm08207 coriell institute plx304 addgene encoding gfp negative control wt ercc2 mutant ercc2 negative control ercc2 deficient cell line lane 1 expresses endogenous levels inactive ercc2 parent cell genome whereas wt lane 2 mutant lanes 3 7 ercc2 cell lines show increased levels ercc2 expressed transfected gene actin shown loading control b cisplatin sensitivity profiles cell lines expressing wt mutant ercc2 expression wt ercc2 ercc2 deficient background rescues cisplatin sensitivity whereas expression ercc2 mutants fails rescue cisplatin sensitivity ic50 calculated survival data cell line values shown c difference ic50 parent ercc2 deficient cell line cell line expressing wt ercc2 statistically significant difference wt ercc2 cell line mutant ercc2 cell lines p 0 0001 anova difference ercc2 deficient cell line mutant cell lines statistically significant ner pathway repairs dna lesions cisplatin adducts also tested effect identified ercc2 mutations ner mediated repair uv induced dna damage wt mutant ercc2 complemented cell lines exposed increasing doses uv irradiation clonogenic survival measured fig 5a c whereas ercc2 wt complemented cell line rescued uv sensitivity uv sensitivities ercc2 mutant complemented cell lines significantly different ercc2 deficient parent cell line taken together experiments suggest observed ercc2 mutations result loss normal ner capacity figure 5 download figureopen new tabdownload powerpoint figure 5 ercc2 mutants fail rescue uv sensitivity ercc2 deficient cells representative colony formation assay ercc2 deficient cell line top well ercc2 deficient line transfected wt ercc2 middle one ercc2 mutants d609g bottom following increased doses uv irradiation b clonogenic survival data negative control wt ercc2 mutant ercc2 wt ercc2 rescues uv sensitivity ercc2 deficient cell line whereas mutant ercc2s fail rescue uv sensitivity c uv ic50 values cell lines difference ercc2 deficient cell line wt ercc2 cell line significant p 0 0001 anova whereas difference ercc2 deficient cell line ercc2 mutant cell lines statistically significant ns overall mutation rate higher ercc2 mutated tumors hypothesized ercc2 mutations may broadly contributing genomic instability thus measured rates chromosomal aberrations wt mutant ercc2 complemented cell lines cisplatin treatment absence cisplatin background rates chromosomal aberrations slightly lower ercc2 wt complemented cell line ercc2 deficient ercc2 mutant complemented cell lines difference statistically significant fig 6a however following cisplatin exposure significantly fewer chromosomal aberrations observed ercc2 wt complemented cell line ercc2 deficient parent cell line p 0 03 anova whereas expression ercc2 mutants resulted rescue chromosomal stability p 0 5 fig 6d data suggest responder associated ercc2 mutations may contribute overall genomic instability tumors figure 6 download figureopen new tabdownload powerpoint figure 6 ercc2 mutants fail rescue genomic instability following cisplatin exposure representative mitotic spreads ercc2 deficient cell line ercc2 deficient cell line transfected wt ercc2 b one ercc2 mutants v242f c following cisplatin exposure chromosomal aberration data ercc2 deficient wt ercc2 mutant ercc2 cell lines rates chromosomal aberrations following cisplatin exposure significantly lower wt ercc2 cell line ercc2 deficient line cell lines expressing mutant ercc2 p 0 03 anova dna repair gene alterations finally sought determine whether dna repair genes might undergo recurrent mutations cisplatin sensitive tumors significantly recurrent mutations observed ner homologous repair genes responders compared nonresponders however two responder tumors ercc2 mutations somatic nonsense truncating mutations detected homologous recombination dna repair genes brca1 brca2 supplementary table s2 nonsynonymous brca1 brca2 mutations nonresponders supplementary table s2 results consistent known sensitivity brca1 2 mutant tumors e g brca1 2 mutant breast ovarian cancers platinum containing regimens 27 whereas somatic ercc2 mutations dna repair gene mutations enriched responders singleton missense mutations uncertain significance observed dna damage response genes responders nonresponders unknown functional relevance discussion bladder cancer clinical benefit neoadjuvant chemotherapy apparent pathologic downstaging pt0 ptis achieved surgical resection following cisplatin based chemotherapy however lack predictive biomarker clinical benefit neoadjuvant cisplatin based chemotherapy limited use approach clinical community due toxicity concerns using extreme phenotype analysis identified association somatic ercc2 mutations pathologic complete response neoadjuvant cisplatin based chemotherapy muscle invasive urothelial carcinoma although ercc2 mutations occur approximately 12 unselected cases 36 cisplatin based chemotherapy responders cohort harbored somatic ercc2 nonsynonymous mutations moreover ercc2 mutant tumors responded neoadjuvant chemotherapy ner pathway highly conserved dna repair system identifies repairs bulky dna adducts arising genotoxic agents cisplatin ner helicase ercc2 unwinds duplex dna near damage site coordinated action two conserved helicase domains ercc2 mutations identified study occurred conserved positions within adjacent helicase domains identified mutants failed complement cisplatin uv sensitivity ercc2 deficient cell line together data suggest mutations result loss normal ercc2 function leading increased tumor cell sensitivity dna damaging agents cisplatin interestingly seven 78 ercc2 mutant cases ercc2 mutation allelic fraction 0 5 supplementary table s2 suggesting wt ercc2 remains present one allele therefore cisplatin sensitivity phenotype may result haploinsufficient dominant negative effect heterozygous ercc2 mutation rather result biallelic inactivating mutations traditional two hit tumor suppressor model driving force heterozygous mutation ercc2 unknown however prevalence ercc2 mutations study cohorts tcga suggests loss normal ercc2 function may provide selective advantage tumors partial loss dna repair fidelity could aid tumor growth decreasing repair associated delays cell cycle progression addition decreased ner capacity may result higher rates error prone repair large scale genomic changes drive tumor growth despite providing potential growth advantage mutation one ercc2 allele may render tumor cells susceptible dna damaging agent cisplatin inadequate levels wt ercc2 present support ner e haploinsufficiency alternatively mutated version ercc2 may bind efficiently repair damaged dna thereby preventing repair alternative dna repair pathway leading dominant negative phenotype described mutants yeast ercc2 homolog rad3 28 studies necessary explore effects ercc2 loss tumor growth mechanism identified ercc2 mutations confer changes tumor ner capacity one possible explanation findings observed study somatic ercc2 mutation associated good prognosis smaller tumors however supported survival data tcga bladder cancer cohort excluded patients received neoadjuvant chemotherapy difference overall survival observed basis somatic ercc2 mutation status supplementary fig s3 extreme response cohort reported patients generally noted obvious disease left behind initial transurethral resection based operative reports imaging making findings unlikely related complete transurethral resection although data yet used justify avoiding cisplatin based treatment ercc2 wt patients findings raise possibility somatic ercc2 mutation status may provide genetic means select patients likely benefit cisplatin based chemotherapy directing patients toward alternative therapeutic approaches note study focused specifically somatic mutations exclusively tumor germline single nucleotide polymorphisms ercc2 genes thus approach distinct genome wide association studies examined germline ercc1 ercc2 polymorphisms mixed impact cisplatin sensitivity 29 broadly findings require independent clinical validation prospective trials establish clinical predictive power somatic ercc2 mutation status cisplatin response possible nonresponding urothelial tumors harbor somatic ercc2 mutations larger cohorts observed examination post chemotherapy resistant tumor would critical understanding whether tumor heterogeneity played role resistance patients analyzed extreme phenotype analysis treated combination cisplatin based chemotherapy regimens containing non cisplatin agents however cisplatin agent common regimens half patients bladder cancer candidates cisplatin based chemotherapy due preexisting comorbidities less toxic carboplatin based neoadjuvant therapies may warrant study non cisplatin eligible patients ercc2 mutant tumors date exceptional response genomic studies informed genomic mechanisms response targeted therapies response everolimus multiple disease contexts 30 31 however published studies limited individual case reports due rarity events targeted therapies study represents new approach studying extraordinary responses commonly used cytotoxic chemotherapies using case control design may applied therapeutic settings significant minority patients achieve exceptional response e g neoadjuvant chemotherapies clinical settings toward end majority responder cases cohort recurrent genomic determinant cisplatin response possible alterations dna repair genes readily detectable wes e g epigenetic expression based may mediate cisplatin sensitivity cases conclusion work provides new insights relationship somatic genetic alterations clinical response cisplatin based chemotherapy urothelial carcinoma validated results may inform therapeutic decision making novel therapeutic development clinical trial designs urothelial carcinoma possibly ercc2 mutated tumors finally results show somatic genomic alterations may reveal mechanistic underpinnings antitumor response conventional cytotoxic chemotherapy methods patients samples patients muscle invasive locally advanced urothelial carcinoma extreme responses chemotherapy defined residual invasive carcinoma cystectomy presence persistent muscle invasive extravesical disease cystectomy available prechemotherapy tumor tissue enrolled institutional review board irb approved tissue acquisition dna sequencing protocols identified dana farber protocols 02 021 11 334 memorial sloan kettering cancer center protocols 89 076 09 025 patients provided written informed consent genomic testing used study specimens evaluated genitourinary pathologists j barletta signoretti dfci cohort h al ahmadie mskcc cohort identify tumor bearing areas dna extraction minimum percentage neoplastic cellularity regions tumor tissue 60 study specimens frozen formalin fixed paraffin embedded ffpe tissue sections identified dana farber cancer institute dfci memorial sloan kettering cancer center mskcc germline dna extracted either peripheral blood mononuclear cells histologically normal nonurothelial tissue information source germline dna available supplementary table s1 wes statistical analysis dna extraction exome sequencing slides cut ffpe frozen tissue blocks examined board certified pathologist select high density cancer foci ensure high purity cancer dna biopsy cores taken corresponding tissue block dna extraction dna extracted using qiagen qiaamp dna ffpe tissue kit quantitation reagent invitrogen dna stored 20 c whole exome capture libraries constructed 100 ng dna tumor normal tissue sample shearing end repair phosphorylation ligation barcoded sequencing adaptors ligated dna size selected lengths 200 350 bp subjected exonic hybrid capture using sureselect v2 exome bait agilent technologies sample multiplexed sequenced using illumina hiseq technology mean target exome coverage 121 tumors 130 germline samples four cases complete exome sequencing process due sequencing process failure bam files generated study deposited database genotypes phenotypes dbgap phs000771 v1 p1 sequence data processing exome sequence data processing analysis performed using pipelines broad institute bam file aligned hg19 human genome build produced using illumina sequencing reads tumor normal sample picard pipeline bam files uploaded firehose infrastructure 32 managed intermediate analysis files executed analysis pipelines bam files uploaded dbgap phs000771 v1 p1 sequencing quality control sequencing data incorporated quality control modules firehose compare tumor normal genotypes ensure concordance samples cross contamination samples individuals sequenced flow cell monitored contest algorithm 33 samples 4 contamination excluded n 4 alteration identification annotation mutect algorithm 34 applied identify somatic single nucleotide variants targeted exons indelocator applied identify small insertions deletions 35 gene level coverage determined depthofcoverage genome analysis tool kit 36 alterations annotated using oncotator 37 power calculations coverage determined using mutect coverage file requires minimum 14 coverage tumor samples median allelic fractions less 0 05 insufficient dna orthogonal validation fluidigm access array excluded n 1 alteration significance mutsigcv 23 applied aggregate cohort 50 cases determine statistically altered genes cohort alterations nominated significant genes mutsigcv manually reviewed integrated genomics viewer igv refs 38 39 alterations invalid based igv review result misalignment artifacts viewable igv subsequently excluded final results resulted exclusion tgfbr1 depdc4 final result supplementary table s3 selective gene enrichment analysis somatic mutations short insertion deletions aggregated cohort split responders nonresponders alterations coverage tumor site 30 allelic fraction 0 1 considered analyses missense nonsense splice site mutations along short insertion deletions assigned damaging score range 0 1 following previously reported methods 40 missense mutations scored using polyphen2 score 41 amino acid substitution missense mutations without available polyphen2 scores due mapping errors dinucleotide status listed unavailable supplementary table s5 excluded nonsense mutations splice site mutations short insertion deletions assigned damaging score 1 damaging scores listed supplementary table s5 alterations damaging score 0 5 tabulated occurrence responders nonresponders altered gene would counted per patient fisher exact test performed compare cohorts derive p value gene minimum six alterations required observe p value 0 01 genes 6 alterations cohort thereby representing 10 cohort highest potential clinical significance considered multiple hypothesis testing results summarized supplementary table s6 fig 2a ercc2 mutation frequency responders compared unselected tcga 21 guo colleagues 22 cohorts binomial test results analysis made available fig 2c comparison ercc2 mutation distribution responders nonresponders adjusted elevated mutation rate responders performed binomial test conditional observing nine mutations using estimated ratio mutation rates responders nonresponders expected frequency ercc2 mutations responders null hypothesis e g ercc2 binomial test x 9 n 9 p mutation rateresponders mutation ratenonresponders alternative greater statistical calculations performed using r statistical package mutation validation orthogonal validation selected mutations short insertion deletions presented article including ercc2 performed using fluidigm access array 50 cases 35 sufficient dna generate sufficient read depth analysis total 85 candidate targets submitted fluidigm single plex pcr primer assay design resulted design 65 assays covering 85 targets assay amplicons ranged 163 199 bp size average 183 bp available samples run access array system fluidigm using three 48 48 access array ifc chips following manufacturer recommendations using 4 primer amplicon tagging protocol access array fluidigm p n 100 3770 rev c1 exception access array ifc chips loaded harvested using bravo automated liquid handling platform agilent technologies following manufacturer recommendations resulting amplicons containing sample specific barcodes illumina adapter sequencers pooled sequenced miseq sequencer illumina two runs 150 base paired end reads v2 sequencing chemistry using custom fluidigm sequencing primers following manufacturer recommendations fluidigm sites manually reviewed igv determine presence absence nonreference reads details validation results ercc2 additional variants given supplementary table s2 variants inadequate sample validation insufficient sequencing reads validation data interpret manually igv listed unavailable cloning cell lines site directed pcr mutagenesis bp attb attp recombination method 42 used generate wt mutant ercc2 open reading frames orf mutant pcr products generated fragments overlap region desired mutation fragments introduced pdonr vector bp reaction bp reaction mixture transformed escherichia coli recombined generate pentr vector pentr vector used perform lr attl attr reaction create expression plasmid expression plasmids harboring wt ercc2 gfp negative control mutant ercc2 constructs expanded e coli top10 cells invitrogen purified using anion exchange kit qiagen lentiviruses propagated 293t cells cotransfection expression plasmid plasmids encoding viral packaging envelope proteins unless otherwise noted human cell lines cultured dmem invitrogen supplemented 10 fbs sigma 1 l glutamine grown 37 c 5 co2 293t cell supernatants containing virus collected 48 hours filtered twice 0 45 syringe filter millipore added directly growing cultures sv40 transformed pseudodiploid ercc2 deficient fibroblast cell line derived xp patient genetic complementation group gm08207 coriell institute cell line compound heterozygote harboring ercc2r683w ercc2 del 36 61 mutations polybrene sigma added final concentration 8 g ml increase efficiency infection stable integrates selected incubation 5 days media containing 10 g ml blastocidin physical biologic containment procedures recombinant dna followed institutional protocols accordance nih guidelines research involving recombinant dna molecules cell line obtained march 2013 corriell cell repositories camden nj authenticated microsatellite genotyping cisplatin sensitivity assays cells transferred 96 well plates density 500 cells per well six hours later cisplatin sigma serially diluted media added wells 96 hours celltiterglo reagent promega added wells plates scanned using luminescence microplate reader biotek survival cisplatin concentration plotted percentage survival cisplatin free media data point graph represents average three independent measurements error bars represent standard deviation ic50 concentrations calculated using four parameter sigmoidal model plots generated using prism graphpad one way anova test used compare ic50 negative control cell line ic50 wt mutant ercc2 cell lines compare ic50 wt line ic50 negative control mutant cell lines uv clonogenic survival assays cells seeded 6 well plates nunc density 1 500 cells per well following day cells washed pbs exposed increasing uv doses using uv b irradiator stratagene medium replaced cells allowed grow 9 days day 10 cells fixed using 1 5 acetic acid methanol solution 20 minutes room temperature cells stained 45 minutes using 1 crystal violet methanol solution plates rinsed vigorously water allowed dry colonies manually counted number colonies present uv dose plotted ratio number colonies present mock irradiated wells data point represents average three independent measurements error bars represent standard deviation chromosomal breakage analysis approximately 1 106 cells seeded per 10 cm dish 24 hours 400 nmol l cisplatin added cells allowed grow additional 48 hours cells exposed colcemid 2 hours harvested using 0 075 mol l kcl fixed 3 1 methanol acetic acid slides stained wright stain 25 50 metaphases analyzed chromosomal aberrations chromosome chromatid breaks rings translocations deletions fragments double minute chromosomes tri quadraradials di tricentrics premature chromatid separation immunoblots frozen cell pellets thawed resuspended ripa buffer 50 mmol l tris ph 7 3 150 mmol l nacl 1 mmol l edta 1 triton x 100 0 5 na deoxycholate 0 1 sds supplemented complete protease inhibitor roche navo4 naf cell suspensions centrifuged total protein concentration supernatant determined colorimetry bio rad samples boiled loading buffer bio rad electrophoresed 3 8 gradient tris acetate gel life technologies resolved proteins transferred polyvinylidene difluoride pvdf membrane millipore 90 v 2 hours 4 c membrane blocked 1 hour blocking solution 5 milk tris buffered saline incubated primary antibody blocking solution 4 c overnight mouse ercc2 abcam rabbit actin cell signaling technology following day membrane rinsed incubated 1 hour peroxidase conjugated secondary antibody blocking solution anti mouse anti rabbit cell signaling technology rinsed enhanced chemiluminescent substrate solution perkinelmer added signal detected film exposure ge healthcare '
no_feature
100
test_df['Gene'].iloc[test_point_index]
'ERCC2'
test_df['Variation'].iloc[test_point_index]
'E606G'
clf.feature_log_prob_.shape
(9, 55944)
indices=np.argsort(-1*abs(clf.feature_log_prob_))[predicted_cls-1][:,:no_feature]
indices[0]
array([    0, 36561, 36568, 36571, 36582, 36583, 36592, 16154, 16153,
       36558, 36594, 36596, 36597, 36601, 36604, 36607, 36609, 36637,
       36638, 36595, 36557, 36556, 36552, 36530, 36531, 36532, 36533,
       36534, 36535, 36536, 16180, 16179, 36537, 36538, 16176, 36539,
       36540, 36541, 36542, 36547, 36551, 16169, 36639, 36643, 36647,
       16139, 16111, 36718, 36719, 36721, 16106, 36722, 16104, 36725,
       36726, 16101, 36727, 36729, 16097, 36734, 36735, 36737, 16092,
       36738, 36740, 36715, 36529, 36713, 36704, 16138, 36657, 36660,
       16133, 36666, 36671, 36675, 16129, 36676, 16127, 36678, 36687,
       36688, 36693, 36694, 36695, 36698, 16117, 36702, 36709, 36742,
       36528, 36527, 36445, 36447, 16256, 36448, 36449, 36452, 36455,
       36456])
# this function will be used just for naive bayes
# for the given indices, we will print the name of the features
# and we will check whether the feature present in the test point text or not
def get_impfeature_names(indices, text, gene, var, no_features):
    gene_count_vec = CountVectorizer()
    var_count_vec = CountVectorizer()
    text_count_vec = CountVectorizer(min_df=3)

    gene_vec = gene_count_vec.fit(train_df['Gene'])
    var_vec  = var_count_vec.fit(train_df['Variation'])
    text_vec = text_count_vec.fit(train_df['TEXT'])

    fea1_len = len(gene_vec.get_feature_names_out())
    fea2_len = len(var_count_vec.get_feature_names_out())

    word_present = 0
    for i,v in enumerate(indices):
        if (v < fea1_len):
            word = gene_vec.get_feature_names_out()[v]
            yes_no = True if word == gene else False
            if yes_no:
                word_present += 1
                print(i, "Gene feature [{}] present in test data point [{}]".format(word,yes_no))
        elif (v < fea1_len+fea2_len):
            word = var_vec.get_feature_names_out()[v-(fea1_len)]
            yes_no = True if word == var else False
            if yes_no:
                word_present += 1
                print(i, "variation feature [{}] present in test data point [{}]".format(word,yes_no))
        else:
            word = text_vec.get_feature_names_out()[v-(fea1_len+fea2_len)]
            yes_no = True if word in text.split() else False
            if yes_no:
                word_present += 1
                print(i, "Text feature [{}] present in test data point [{}]".format(word,yes_no))

    print("Out of the top ",no_features," features ", word_present, "are present in query point")
for i in range(10):
    test_point_index = i
    no_feature = 100
    predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
    print("Predicted Class :", predicted_cls[0])
    print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
    print("Actual Class :", test_y[test_point_index])
    indices=np.argsort(-1*abs(clf.feature_log_prob_))[predicted_cls-1][:,:no_feature]
    print("-"*50)
    get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index].lower(),test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 6
Predicted Class Probabilities: [4.090e-02 1.070e-02 5.400e-03 5.280e-02 7.330e-02 7.635e-01 4.870e-02
  4.300e-03 3.000e-04](../4090e-02-1070e-02-5400e-03-5280e-02-7330e-02-7635e-01-4870e-02--4300e-03-3000e-04 "4.090e-02 1.070e-02 5.400e-03 5.280e-02 7.330e-02 7.635e-01 4.870e-02
  4.300e-03 3.000e-04")
Actual Class : 6
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 1
Predicted Class Probabilities: [0.4597 0.1588 0.0104 0.0636 0.0274 0.0241 0.2489 0.0059 0.0011](../04597-01588-00104-00636-00274-00241-02489-00059-00011 "0.4597 0.1588 0.0104 0.0636 0.0274 0.0241 0.2489 0.0059 0.0011")
Actual Class : 1
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [2.480e-02 8.020e-02 5.000e-04 8.200e-03 3.200e-03 1.100e-03 8.782e-01
  3.800e-03 0.000e+00](../2480e-02-8020e-02-5000e-04-8200e-03-3200e-03-1100e-03-8782e-01--3800e-03-0000e00 "2.480e-02 8.020e-02 5.000e-04 8.200e-03 3.200e-03 1.100e-03 8.782e-01
  3.800e-03 0.000e+00")
Actual Class : 2
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [0.0345 0.3528 0.0149 0.0416 0.0342 0.0314 0.4819 0.0067 0.0019](../00345-03528-00149-00416-00342-00314-04819-00067-00019 "0.0345 0.3528 0.0149 0.0416 0.0342 0.0314 0.4819 0.0067 0.0019")
Actual Class : 2
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [0.0698 0.0858 0.0264 0.1075 0.0507 0.0511 0.5949 0.0094 0.0044](../00698-00858-00264-01075-00507-00511-05949-00094-00044 "0.0698 0.0858 0.0264 0.1075 0.0507 0.0511 0.5949 0.0094 0.0044")
Actual Class : 7
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 4
Predicted Class Probabilities: [0.0705 0.0504 0.0164 0.6325 0.0329 0.03   0.1588 0.0063 0.0021](../00705-00504-00164-06325-00329-003---01588-00063-00021 "0.0705 0.0504 0.0164 0.6325 0.0329 0.03   0.1588 0.0063 0.0021")
Actual Class : 7
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [0.0673 0.0258 0.0018 0.1074 0.0071 0.0048 0.7812 0.0047 0.    ](../00673-00258-00018-01074-00071-00048-07812-00047-0---- "0.0673 0.0258 0.0018 0.1074 0.0071 0.0048 0.7812 0.0047 0.    ")
Actual Class : 7
--------------------------------------------------
59 Text feature [bub1] present in test data point [True]
Out of the top  100  features  1 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [0.0984 0.2299 0.0324 0.1407 0.06   0.0666 0.356  0.0094 0.0067](../00984-02299-00324-01407-006---00666-0356--00094-00067 "0.0984 0.2299 0.0324 0.1407 0.06   0.0666 0.356  0.0094 0.0067")
Actual Class : 2
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [0.0556 0.1443 0.054  0.0837 0.0528 0.0493 0.5469 0.009  0.0045](../00556-01443-0054--00837-00528-00493-05469-0009--00045 "0.0556 0.1443 0.054  0.0837 0.0528 0.0493 0.5469 0.009  0.0045")
Actual Class : 7
--------------------------------------------------
Out of the top  100  features  0 are present in query point
Predicted Class : 7
Predicted Class Probabilities: [5.400e-03 5.200e-03 3.200e-03 6.800e-03 8.700e-03 4.900e-03 9.607e-01
  5.000e-03 1.000e-04](../5400e-03-5200e-03-3200e-03-6800e-03-8700e-03-4900e-03-9607e-01--5000e-03-1000e-04 "5.400e-03 5.200e-03 3.200e-03 6.800e-03 8.700e-03 4.900e-03 9.607e-01
  5.000e-03 1.000e-04")
Actual Class : 7
--------------------------------------------------
Out of the top  100  features  0 are present in query point

Feature Importance, Incorrectly classified point

test_point_index = 100
no_feature = 100
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-1*abs(clf.feature_log_prob_))[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 7
Predicted Class Probabilities: [3.800e-02 7.340e-02 3.700e-03 4.260e-02 1.190e-02 9.300e-03 8.155e-01
  5.500e-03 1.000e-04](../3800e-02-7340e-02-3700e-03-4260e-02-1190e-02-9300e-03-8155e-01--5500e-03-1000e-04 "3.800e-02 7.340e-02 3.700e-03 4.260e-02 1.190e-02 9.300e-03 8.155e-01
  5.500e-03 1.000e-04")
Actual Class : 7
--------------------------------------------------
Out of the top  100  features  0 are present in query point

Nearest Neighbour Classification

Hyper parameter tuning

alpha = [5, 11, 15, 21, 31, 41, 51, 99]
cv_log_error_array = []
for i in alpha:
    print("for alpha =", i)
    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(train_x_responseCoding, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x_responseCoding, train_y)
    sig_clf_probs = sig_clf.predict_proba(cv_x_responseCoding)
    cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
    # to avoid rounding error while multiplying probabilites we use log-probability estimates
    print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],str(txt)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = KNeighborsClassifier(n_neighbors=alpha[best_alpha])
clf.fit(train_x_responseCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_responseCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_responseCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_responseCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_responseCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
for alpha = 5
Log Loss : 1.1036779125472609
for alpha = 11
Log Loss : 1.0683198189507492
for alpha = 15
Log Loss : 1.0580822704500141
for alpha = 21
Log Loss : 1.0563308294497766
for alpha = 31
Log Loss : 1.0560537205374698
for alpha = 41
Log Loss : 1.0826667857571959
for alpha = 51
Log Loss : 1.0901258431039602
for alpha = 99
Log Loss : 1.1253028140655437

For values of best alpha =  31 The train log loss is: 0.7840473307464905
For values of best alpha =  31 The cross validation log loss is: 1.0560537205374698
For values of best alpha =  31 The test log loss is: 1.1103171665646643

Testing the model with best hyper paramters

clf = KNeighborsClassifier(n_neighbors=alpha[best_alpha])
predict_and_plot_confusion_matrix(train_x_responseCoding, train_y, cv_x_responseCoding, cv_y, clf)
Log loss : 1.0560537205374698
Number of mis-classified points : 0.3684210526315789

Sample Query point -1

clf = KNeighborsClassifier(n_neighbors=alpha[best_alpha])
clf.fit(train_x_responseCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_responseCoding, train_y)

test_point_index = 1
predicted_cls = sig_clf.predict(test_x_responseCoding[0].reshape(1,-1))
print("Predicted Class :", predicted_cls[0])
print("Actual Class :", test_y[test_point_index])
neighbors = clf.kneighbors(test_x_responseCoding[test_point_index].reshape(1, -1), alpha[best_alpha])
print("The ",alpha[best_alpha]," nearest neighbours of the test points belongs to classes",train_y[neighbors[1][0]])
print("Fequency of nearest points :",Counter(train_y[neighbors[1][0]]))
Predicted Class : 6
Actual Class : 1
The  31  nearest neighbours of the test points belongs to classes [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 5 1 1 1 1]
Fequency of nearest points : Counter({1: 28, 2: 1, 4: 1, 5: 1})

Sample Query Point-2

clf = KNeighborsClassifier(n_neighbors=alpha[best_alpha])
clf.fit(train_x_responseCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_responseCoding, train_y)

test_point_index = 100

predicted_cls = sig_clf.predict(test_x_responseCoding[test_point_index].reshape(1,-1))
print("Predicted Class :", predicted_cls[0])
print("Actual Class :", test_y[test_point_index])
neighbors = clf.kneighbors(test_x_responseCoding[test_point_index].reshape(1, -1), alpha[best_alpha])
print("the k value for knn is",alpha[best_alpha],"and the nearest neighbours of the test points belongs to classes",train_y[neighbors[1][0]])
print("Fequency of nearest points :",Counter(train_y[neighbors[1][0]]))
Predicted Class : 7
Actual Class : 7
the k value for knn is 31 and the nearest neighbours of the test points belongs to classes [7 2 7 7 2 7 2 2 2 7 6 7 7 2 7 7 7 7 6 2 7 7 2 7 7 7 2 6 7 2 2]
Fequency of nearest points : Counter({7: 17, 2: 11, 6: 3})

Logistic Regression

Hyper paramter tuning

alpha = [10 ** x for x in range(-6, 3)]
cv_log_error_array = []
for i in alpha:
    print("for alpha =", i)
    clf = SGDClassifier(class_weight='balanced', alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_x_onehotCoding, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x_onehotCoding, train_y)
    sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
    cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
    # to avoid rounding error while multiplying probabilites we use log-probability estimates
    print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],str(txt)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
for alpha = 1e-06
Log Loss : 1.3549611984629366
for alpha = 1e-05
Log Loss : 1.3243830764474769
for alpha = 0.0001
Log Loss : 1.1484035371750703
for alpha = 0.001
Log Loss : 1.1469421660609054
for alpha = 0.01
Log Loss : 1.2277813159484565
for alpha = 0.1
Log Loss : 1.443192254191518
for alpha = 1
Log Loss : 1.6513624798101794
for alpha = 10
Log Loss : 1.6801064048265375
for alpha = 100
Log Loss : 1.6832847660500232

For values of best alpha =  0.001 The train log loss is: 0.5051957904999469
For values of best alpha =  0.001 The cross validation log loss is: 1.1469421660609054
For values of best alpha =  0.001 The test log loss is: 1.1285955963127317

Testing the model with best hyper paramters

clf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
predict_and_plot_confusion_matrix(train_x_onehotCoding, train_y, cv_x_onehotCoding, cv_y, clf)
Log loss : 1.1469421660609054
Number of mis-classified points : 0.35902255639097747

Feature Importance Correctly Classified point

def get_imp_feature_names(text, indices, removed_ind = []):
    word_present = 0
    tabulte_list = []
    incresingorder_ind = 0
    for i in indices:
        if i < train_gene_feature_onehotCoding.shape[1]:
            tabulte_list.append([incresingorder_ind, "Gene", "Yes"])
        elif i< 18:
            tabulte_list.append([incresingorder_ind,"Variation", "Yes"])
        if ((i > 17) & (i not in removed_ind)) :
            word = train_text_features[i]
            yes_no = True if word in text.split() else False
            if yes_no:
                word_present += 1
            tabulte_list.append([incresingorder_ind,train_text_features[i], yes_no])
        incresingorder_ind += 1
    print(word_present, "most importent features are present in our query point")
    print("-"*50)
    print("The features that are most importent of the ",predicted_cls[0]," class:")
    print (tabulate(tabulte_list, headers=["Index",'Feature name', 'Present or Not']))
# from tabulate import tabulate
clf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(train_x_onehotCoding,train_y)
test_point_index = 1
no_feature = 500
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-1*abs(clf.coef_))[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 1
Predicted Class Probabilities: [0.8104 0.1318 0.0037 0.0126 0.0098 0.0058 0.0193 0.0046 0.002 ](../08104-01318-00037-00126-00098-00058-00193-00046-0002- "0.8104 0.1318 0.0037 0.0126 0.0098 0.0058 0.0193 0.0046 0.002 ")
Actual Class : 1
--------------------------------------------------
252 Text feature [aggregated] present in test data point [True]
292 Text feature [processing] present in test data point [True]
342 Text feature [expressing] present in test data point [True]
367 Text feature [archaebacterial] present in test data point [True]
400 Text feature [contest] present in test data point [True]
412 Text feature [premature] present in test data point [True]
Out of the top  500  features  6 are present in query point

Feature Importance Incorrectly Classified point

test_point_index = 100
no_feature = 500
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-1*abs(clf.coef_))[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 7
Predicted Class Probabilities: [0.0168 0.1361 0.0015 0.0186 0.0058 0.0051 0.8105 0.0043 0.0013](../00168-01361-00015-00186-00058-00051-08105-00043-00013 "0.0168 0.1361 0.0015 0.0186 0.0058 0.0051 0.8105 0.0043 0.0013")
Actual Class : 7
--------------------------------------------------
87 Text feature [constitutive] present in test data point [True]
119 Text feature [tx] present in test data point [True]
123 Text feature [thyroid] present in test data point [True]
133 Text feature [constitutively] present in test data point [True]
135 Text feature [3t3] present in test data point [True]
159 Text feature [expressing] present in test data point [True]
193 Text feature [ligand] present in test data point [True]
207 Text feature [function] present in test data point [True]
213 Text feature [egf] present in test data point [True]
221 Text feature [activator] present in test data point [True]
238 Text feature [activating] present in test data point [True]
272 Text feature [receptors] present in test data point [True]
276 Text feature [extracellular] present in test data point [True]
315 Text feature [activation] present in test data point [True]
320 Text feature [loss] present in test data point [True]
322 Text feature [beas] present in test data point [True]
377 Text feature [y1068] present in test data point [True]
392 Text feature [stably] present in test data point [True]
452 Text feature [putative] present in test data point [True]
465 Text feature [purified] present in test data point [True]
475 Text feature [predicted] present in test data point [True]
Out of the top  500  features  21 are present in query point

Linear Support Vector Machines

Hyper paramter tuning

alpha = [10 ** x for x in range(-5, 3)]
cv_log_error_array = []
for i in alpha:
    print("for C =", i)
#     clf = SVC(C=i,kernel='linear',probability=True, class_weight='balanced')
    clf = SGDClassifier( class_weight='balanced', alpha=i, penalty='l2', loss='hinge', random_state=42)
    clf.fit(train_x_onehotCoding, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x_onehotCoding, train_y)
    sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
    cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
    print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],str(txt)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
# clf = SVC(C=i,kernel='linear',probability=True, class_weight='balanced')
clf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha], penalty='l2', loss='hinge', random_state=42)
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
for C = 1e-05
Log Loss : 1.3253802575992406
for C = 0.0001
Log Loss : 1.2873766645967677
for C = 0.001
Log Loss : 1.1596908320059625
for C = 0.01
Log Loss : 1.1847648563983895
for C = 0.1
Log Loss : 1.3879701310719252
for C = 1
Log Loss : 1.6675922322714856
for C = 10
Log Loss : 1.6838340386799557
for C = 100
Log Loss : 1.6838340836749632

For values of best alpha =  0.001 The train log loss is: 0.5220013914091015
For values of best alpha =  0.001 The cross validation log loss is: 1.1596908320059625
For values of best alpha =  0.001 The test log loss is: 1.168790814376124

Testing model with best hyper parameters

# clf = SVC(C=alpha[best_alpha],kernel='linear',probability=True, class_weight='balanced')
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='hinge', random_state=42,class_weight='balanced')
predict_and_plot_confusion_matrix(train_x_onehotCoding, train_y,cv_x_onehotCoding,cv_y, clf)
Log loss : 1.1596908320059625
Number of mis-classified points : 0.3684210526315789

Feature Importance Correctly classified point

clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='hinge', random_state=42)
clf.fit(train_x_onehotCoding,train_y)
test_point_index = 1
# test_point_index = 100
no_feature = 500
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-1*abs(clf.coef_))[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 1
Predicted Class Probabilities: [0.7576 0.0853 0.009  0.0327 0.023  0.0188 0.0649 0.0035 0.0051](../07576-00853-0009--00327-0023--00188-00649-00035-00051 "0.7576 0.0853 0.009  0.0327 0.023  0.0188 0.0649 0.0035 0.0051")
Actual Class : 1
--------------------------------------------------
414 Text feature [aggregated] present in test data point [True]
Out of the top  500  features  1 are present in query point

Feature Importance Incorrectly classified point

test_point_index = 100
no_feature = 500
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-1*abs(clf.coef_))[predicted_cls-1][:,:no_feature]
print("-"*50)
get_impfeature_names(indices[0], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
Predicted Class : 7
Predicted Class Probabilities: [0.035  0.1263 0.0046 0.0505 0.0199 0.0145 0.7402 0.0037 0.0055](../0035--01263-00046-00505-00199-00145-07402-00037-00055 "0.035  0.1263 0.0046 0.0505 0.0199 0.0145 0.7402 0.0037 0.0055")
Actual Class : 7
--------------------------------------------------
Out of the top  500  features  0 are present in query point

Random Forest Classifier

Hyper paramter tuning (With One hot Encoding)

train_x_onehotCoding = train_x_onehotCoding.toarray()
cv_x_onehotCoding = cv_x_onehotCoding.toarray()
test_x_onehotCoding = test_x_onehotCoding.toarray()

alpha = [100,200,500,1000,2000]
max_depth = [5, 10]
cv_log_error_array = []
for i in alpha:
    for j in max_depth:
        print("for n_estimators =", i,"and max depth = ", j)
        clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j, random_state=42, n_jobs=-1)
        clf.fit(train_x_onehotCoding, train_y)
        sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
        sig_clf.fit(train_x_onehotCoding, train_y)
        sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
        cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
        print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

'''fig, ax = plt.subplots()
features = np.dot(np.array(alpha)[:,None],np.array(max_depth)[None]).ravel()
ax.plot(features, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[int(i/2)],max_depth[int(i%2)],str(txt)), (features[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
'''

best_alpha = np.argmin(cv_log_error_array)
clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_onehotCoding)
print('For values of best estimator = ', alpha[int(best_alpha/2)], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_onehotCoding)
print('For values of best estimator = ', alpha[int(best_alpha/2)], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_onehotCoding)
print('For values of best estimator = ', alpha[int(best_alpha/2)], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
for n_estimators = 100 and max depth =  5
Log Loss : 1.2202756659342509
for n_estimators = 100 and max depth =  10
Log Loss : 1.159525801135362
for n_estimators = 200 and max depth =  5
Log Loss : 1.2190339692179335
for n_estimators = 200 and max depth =  10
Log Loss : 1.1575317490127754
for n_estimators = 500 and max depth =  5
Log Loss : 1.2133198177641868
for n_estimators = 500 and max depth =  10
Log Loss : 1.1525485271317322
for n_estimators = 1000 and max depth =  5
Log Loss : 1.2117340996922596
for n_estimators = 1000 and max depth =  10
Log Loss : 1.151182438108908
for n_estimators = 2000 and max depth =  5
Log Loss : 1.2098738302870609
for n_estimators = 2000 and max depth =  10
Log Loss : 1.1505321184953212
For values of best estimator =  2000 The train log loss is: 0.6727539823286758
For values of best estimator =  2000 The cross validation log loss is: 1.1505321184953212
For values of best estimator =  2000 The test log loss is: 1.1632614990596637

Testing model with best hyper parameters (One Hot Encoding)

clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)
predict_and_plot_confusion_matrix(train_x_onehotCoding, train_y,cv_x_onehotCoding,cv_y, clf)
Log loss : 1.1505321184953212
Number of mis-classified points : 0.35150375939849626

Feature Importance Correctly Classified point

# test_point_index = 10
clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)

test_point_index = 1
no_feature = 100
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-clf.feature_importances_)
print("-"*50)
get_impfeature_names(indices[:no_feature], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/tz/6dwwl4_92xv0hspfrgj59nch0000gp/T/ipykernel_30515/3554025366.py in <cell line: 9>()
      7 test_point_index = 1
      8 no_feature = 100
----> 9 predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
     10 print("Predicted Class :", predicted_cls[0])
     11 print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/calibration.py in predict(self, X)
    499         """
    500         check_is_fitted(self)
--> 501         return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
    502 
    503     def _more_tags(self):

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    475         mean_proba = np.zeros((_num_samples(X), len(self.classes_)))
    476         for calibrated_classifier in self.calibrated_classifiers_:
--> 477             proba = calibrated_classifier.predict_proba(X)
    478             mean_proba += proba
    479 

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    749         n_classes = len(self.classes)
    750         pred_method, method_name = _get_prediction_method(self.estimator)
--> 751         predictions = _compute_predictions(pred_method, method_name, X, n_classes)
    752 
    753         label_encoder = LabelEncoder().fit(self.classes)

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/calibration.py in _compute_predictions(pred_method, method_name, X, n_classes)
    640         (X.shape[0], 1).
    641     """
--> 642     predictions = pred_method(X=X)
    643 
    644     if method_name == "decision_function":

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
    861         check_is_fitted(self)
    862         # Check data
--> 863         X = self._validate_X_predict(X)
    864 
    865         # Assign chunk of trees to jobs

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
    601         Validate X whenever one tries to predict, apply, predict_proba."""
    602         check_is_fitted(self)
--> 603         X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
    604         if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
    605             raise ValueError("No support for np.int64 index based sparse matrices")

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    533             raise ValueError("Validation should be done on X, y or both.")
    534         elif not no_val_X and no_val_y:
--> 535             X = check_array(X, input_name="X", **check_params)
    536             out = X
    537         elif no_val_X and not no_val_y:

~/miniconda3/envs/jupyter/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    898             # If input is 1D raise error
    899             if array.ndim == 1:
--> 900                 raise ValueError(
    901                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    902                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[0. 0. 0. ... 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Feature Importance Inorrectly Classified point

test_point_index = 100
no_feature = 100
predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))
print("Actuall Class :", test_y[test_point_index])
indices = np.argsort(-clf.feature_importances_)
print("-"*50)
get_impfeature_names(indices[:no_feature], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)

Hyper paramter tuning (With Response Coding)

alpha = [10,50,100,200,500,1000]
max_depth = [2,3,5,10]
cv_log_error_array = []
for i in alpha:
    for j in max_depth:
        print("for n_estimators =", i,"and max depth = ", j)
        clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j, random_state=42, n_jobs=-1)
        clf.fit(train_x_responseCoding, train_y)
        sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
        sig_clf.fit(train_x_responseCoding, train_y)
        sig_clf_probs = sig_clf.predict_proba(cv_x_responseCoding)
        cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))
        print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 
'''
fig, ax = plt.subplots()
features = np.dot(np.array(alpha)[:,None],np.array(max_depth)[None]).ravel()
ax.plot(features, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[int(i/4)],max_depth[int(i%4)],str(txt)), (features[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
'''

best_alpha = np.argmin(cv_log_error_array)
clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_depth=max_depth[int(best_alpha%4)], random_state=42, n_jobs=-1)
clf.fit(train_x_responseCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_responseCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_responseCoding)
print('For values of best alpha = ', alpha[int(best_alpha/4)], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_responseCoding)
print('For values of best alpha = ', alpha[int(best_alpha/4)], "The cross validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_responseCoding)
print('For values of best alpha = ', alpha[int(best_alpha/4)], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))

Testing model with best hyper parameters (Response Coding)

clf = RandomForestClassifier(max_depth=max_depth[int(best_alpha%4)], n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_features='auto',random_state=42)
predict_and_plot_confusion_matrix(train_x_responseCoding, train_y,cv_x_responseCoding,cv_y, clf)

Feature Importance Correctly Classified point

clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_depth=max_depth[int(best_alpha%4)], random_state=42, n_jobs=-1)
clf.fit(train_x_responseCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_responseCoding, train_y)


test_point_index = 1
no_feature = 27
predicted_cls = sig_clf.predict(test_x_responseCoding[test_point_index].reshape(1,-1))
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_responseCoding[test_point_index].reshape(1,-1)),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-clf.feature_importances_)
print("-"*50)
for i in indices:
    if i<9:
        print("Gene is important feature")
    elif i<18:
        print("Variation is important feature")
    else:
        print("Text is important feature")

Feature Importance Incorrectly Classified point

test_point_index = 100
predicted_cls = sig_clf.predict(test_x_responseCoding[test_point_index].reshape(1,-1))
print("Predicted Class :", predicted_cls[0])
print("Predicted Class Probabilities:", np.round(sig_clf.predict_proba(test_x_responseCoding[test_point_index].reshape(1,-1)),4))
print("Actual Class :", test_y[test_point_index])
indices = np.argsort(-clf.feature_importances_)
print("-"*50)
for i in indices:
    if i<9:
        print("Gene is important feature")
    elif i<18:
        print("Variation is important feature")
    else:
        print("Text is important feature")

Future Work

  • Apply All the models with tf-idf features (Replace CountVectorizer with tfidfVectorizer and run the same cells)
  • Instead of using all the words in the dataset, use only the top 1000 words based of tf-idf values
  • Apply Logistic regression with CountVectorizer Features, including both unigrams and bigrams
  • Try any of the feature engineering techniques discussed in the course to reduce the CV and test log-loss to a value less than 1.0